Scalable Multi-Framework Multi-Tenant Lifecycle Management of Deep Learning Training Jobs
نویسندگان
چکیده
With the ongoing rise and phenomenal success of machine learning (ML), particularly deep learning, efficient training of large neural network models in scalable cloud infrastructures becomes a priority. ML workloads have traditionally been run in high-performance computing (HPC) environments, where users log in to dedicated machines and utilize the attached GPUs to run jobs that train models on huge datasets. Providing a similar user experience in a multi-tenant cloud environment comes with its own unique challenges regarding fault tolerance, performance, and security. We tackle these challenges and present a deep learning stack specifically designed for on-demand cloud environments. Based on a detailed discussion of the system architecture, we examine real usage data from internal users, and discuss performance experiments that illustrate the scalability of the system.
منابع مشابه
OPTiC: Opportunistic Graph Processing in Multi-Tenant Clusters
We present OPTiC, a multi-tenant scheduler intended for distributed graph processing frameworks. OPTiC proposes opportunistic scheduling, whereby queued jobs can be pre-scheduled at cluster nodes when the cluster is fully busy running jobs. This allows overlapping of data ingress with ongoing computation. To pre-schedule wisely, OPTiC’s novel contribution is a profile-free and cluster-agnostic ...
متن کاملSaaS Multi-Tenancy: Framework, Technology, and Case Study
SaaS (Software as a Service) provides new business opportunities for application providers to serve more customers in a scalable and cost-effective way. SaaS also raises new challenges and one of them is multi-tenancy. Multi-tenancy is the requirement of deploying only one shared application to serve multiple customers (i.e. tenant) instead of deploying one dedicated application for each custom...
متن کاملDynamic configuration and collaborative scheduling in supply chains based on scalable multi-agent architecture
Due to diversified and frequently changing demands from customers, technological advances and global competition, manufacturers rely on collaboration with their business partners to share costs, risks and expertise. How to take advantage of advancement of technologies to effectively support operations and create competitive advantage is critical for manufacturers to survive. To respond to these...
متن کاملTempo: Robust and Self-Tuning Resource Management in Multi-tenant Parallel Databases
Multi-tenant database systems have a component called the Resource Manager, or RM that is responsible for allocating resources to tenants. RMs today do not provide direct support for performance objectives such as: “Average job response time of tenant A must be less than two minutes”, or “No more than 5% of tenant B’s jobs can miss the deadline of 1 hour.” Thus, DBAs have to tinker with the RM’...
متن کاملStochastic Variational Deep Kernel Learning
Deep kernel learning combines the non-parametric flexibility of kernel methods with the inductive biases of deep learning architectures. We propose a novel deep kernel learning model and stochastic variational inference procedure which generalizes deep kernel learning approaches to enable classification, multi-task learning, additive covariance structures, and stochastic gradient training. Spec...
متن کاملذخیره در منابع من
با ذخیره ی این منبع در منابع من، دسترسی به آن را برای استفاده های بعدی آسان تر کنید
عنوان ژورنال:
دوره شماره
صفحات -
تاریخ انتشار 2017